A superstore is planning for the year-end sale. They want to launch a new offer - gold membership, that gives a 20% discount on all purchases, for only $499 which is $999 on other days. It will be valid only for existing customers and the campaign through phone calls is currently being planned for them. The management feels that the best way to reduce the cost of the campaign is to make a predictive model which will classify customers who might purchase the offer.
Source of data: https://www.kaggle.com/datasets/ahsan81/superstore-marketing-campaign-dataset
| Variable | Description |
|---|---|
| CUSTOMERS INFORMATION | |
| Response (target) | 1 if customer accepted the offer in the last campaign, 0 otherwise |
| ID | Unique ID of each customer |
| Year_Birth | Age of the customer |
| Dt_Customer | Date of customer’s enrollment with the company |
| Education | Customer’s level of education (Basic, Second Cycle, Graduation, Master, PhD) |
| Marital | Customer’s marital status (Single, Together, Married, Divorced, Other) |
| Kidhome | Number of small children in customer’s household (0, 1, 2) |
| Teenhome | Number of teenagers in customer’s household (0, 1, 2) |
| Income | Customer’s yearly household income |
| Complain | 1 if the customer complained in the last 2 years |
| Recency | Number of days since the last purchase |
| AMOUNT OF MONEY SPENT ON TYPES OF PRODUCTS | |
| MntFishProducts | The amount spent on fish products in the last 2 years |
| MntMeatProducts | The amount spent on meat products in the last 2 years |
| MntFruits | The amount spent on fruits products in the last 2 years |
| MntSweetProducts | The amount spent on sweet products in the last 2 years |
| MntWines | The amount spent on wine products in the last 2 years |
| MntGoldProds | The amount spent on gold products in the last 2 years |
| NUMBER OF PURCHASES MADE THROUGH TYPES OF CHANNELS | |
| NumDealsPurchases | Number of purchases made with discount |
| NumCatalogPurchases | Number of purchases made using catalog (buying goods to be shipped through the mail) |
| NumStorePurchases | Number of purchases made directly in stores |
| NumWebPurchases | Number of purchases made through the company’s website |
| NumWebVisitsMonth | Number of visits to company’s website in the last month |
For this project, we are assuming that the survey ended in December 2014 and we are doing our analyses in January 2015. This will be the basis for transforming some of our date-time variables later.
# overview of data attributes
str(super)
## 'data.frame': 2240 obs. of 22 variables:
## $ Id : int 1826 1 10476 1386 5371 7348 4073 1991 4047 9477 ...
## $ Year_Birth : int 1970 1961 1958 1967 1989 1958 1954 1967 1954 1954 ...
## $ Education : chr "Graduation" "Graduation" "Graduation" "Graduation" ...
## $ Marital_Status : chr "Divorced" "Single" "Married" "Together" ...
## $ Income : int 84835 57091 67267 32474 21474 71691 63564 44931 65324 65324 ...
## $ Kidhome : int 0 0 0 1 1 0 0 0 0 0 ...
## $ Teenhome : int 0 0 1 1 0 0 0 1 1 1 ...
## $ Dt_Customer : chr "6/16/2014" "6/15/2014" "5/13/2014" "11/5/2014" ...
## $ Recency : int 0 0 0 0 0 0 0 0 0 0 ...
## $ MntWines : int 189 464 134 10 6 336 769 78 384 384 ...
## $ MntFruits : int 104 5 11 0 16 130 80 0 0 0 ...
## $ MntMeatProducts : int 379 64 59 1 24 411 252 11 102 102 ...
## $ MntFishProducts : int 111 7 15 0 11 240 15 0 21 21 ...
## $ MntSweetProducts : int 189 0 2 0 0 32 34 0 32 32 ...
## $ MntGoldProds : int 218 37 30 0 34 43 65 7 5 5 ...
## $ NumDealsPurchases : int 1 1 1 1 2 1 1 1 3 3 ...
## $ NumWebPurchases : int 4 7 3 1 3 4 10 2 6 6 ...
## $ NumCatalogPurchases: int 4 3 2 0 1 7 10 1 2 2 ...
## $ NumStorePurchases : int 6 7 5 2 2 5 7 3 9 9 ...
## $ NumWebVisitsMonth : int 1 5 2 7 7 2 6 5 4 4 ...
## $ Response : int 1 1 0 0 1 1 1 0 0 0 ...
## $ Complain : int 0 0 0 0 0 0 0 0 0 0 ...
From a quick look at the dataset, we are not interested in the birth year of the customers or the date of customers’ enrollment with the company. However, the age profile of the customers might have some relationship with the customers’ response. As such, we will create a new column called Age from the Year_Birth column. We will also transform the date of customers’ enrollment into the number of days since the customers joined (counting up until 01/01/2015)in a new column called days_joined. Then, since Year_Birth and Dt_Customer contain redundant information, we will remove them from our dataset.
# transform Year_Birth into Age
yr <- 2015
super["Age"] <- yr - super["Year_Birth"]
# set today as 01/01/2015
today <- as.Date("01/01/2015", format = "%m/%d/%Y")
# reformat the date in Dt_Customer into the YYYY-MM-DD format
super$Dt_formatted <- as.POSIXct(super$Dt_Customer, format="%m/%d/%Y")
# calculate the number of days since customers joined
super$days_joined <- as.numeric(difftime(today, super$Dt_formatted), unit="days")
# remove redundant columns (Year_Birth, Dt_Customer and Dt_formatted)
super <- subset(super, select = -c(Year_Birth, Dt_Customer, Dt_formatted))
# descriptive summary of the data
summary(super)
## Id Education Marital_Status Income
## Min. : 0 Length:2240 Length:2240 Min. : 1730
## 1st Qu.: 2828 Class :character Class :character 1st Qu.: 35303
## Median : 5458 Mode :character Mode :character Median : 51382
## Mean : 5592 Mean : 52247
## 3rd Qu.: 8428 3rd Qu.: 68522
## Max. :11191 Max. :666666
## NA's :24
## Kidhome Teenhome Recency MntWines
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.00
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:24.00 1st Qu.: 23.75
## Median :0.0000 Median :0.0000 Median :49.00 Median : 173.50
## Mean :0.4442 Mean :0.5062 Mean :49.11 Mean : 303.94
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:74.00 3rd Qu.: 504.25
## Max. :2.0000 Max. :2.0000 Max. :99.00 Max. :1493.00
##
## MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.0 1st Qu.: 16.0 1st Qu.: 3.00 1st Qu.: 1.00
## Median : 8.0 Median : 67.0 Median : 12.00 Median : 8.00
## Mean : 26.3 Mean : 166.9 Mean : 37.53 Mean : 27.06
## 3rd Qu.: 33.0 3rd Qu.: 232.0 3rd Qu.: 50.00 3rd Qu.: 33.00
## Max. :199.0 Max. :1725.0 Max. :259.00 Max. :263.00
##
## MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 9.00 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 0.000
## Median : 24.00 Median : 2.000 Median : 4.000 Median : 2.000
## Mean : 44.02 Mean : 2.325 Mean : 4.085 Mean : 2.662
## 3rd Qu.: 56.00 3rd Qu.: 3.000 3rd Qu.: 6.000 3rd Qu.: 4.000
## Max. :362.00 Max. :15.000 Max. :27.000 Max. :28.000
##
## NumStorePurchases NumWebVisitsMonth Response Complain
## Min. : 0.00 Min. : 0.000 Min. :0.0000 Min. :0.000000
## 1st Qu.: 3.00 1st Qu.: 3.000 1st Qu.:0.0000 1st Qu.:0.000000
## Median : 5.00 Median : 6.000 Median :0.0000 Median :0.000000
## Mean : 5.79 Mean : 5.317 Mean :0.1491 Mean :0.009375
## 3rd Qu.: 8.00 3rd Qu.: 7.000 3rd Qu.:0.0000 3rd Qu.:0.000000
## Max. :13.00 Max. :20.000 Max. :1.0000 Max. :1.000000
##
## Age days_joined
## Min. : 19.00 Min. : 25.67
## 1st Qu.: 38.00 1st Qu.: 366.42
## Median : 45.00 Median : 538.71
## Mean : 46.19 Mean : 537.74
## 3rd Qu.: 56.00 3rd Qu.: 710.92
## Max. :122.00 Max. :1088.67
##
# visualize missing data
vis_miss(super) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
# total number of missing values
cat("\nTotal number of missing values:", sum(is.na(super)))
##
## Total number of missing values: 24
# percentage of missing data
cat("\nPercentage of missing data:", (sum(is.na(super))/nrow(super))*100, "%")
##
## Percentage of missing data: 1.071429 %
There are some missing values in the Income variable. Since income might have an impact on the purchase capability of the customers and the missing data constitute only 1.07% of the entire dataset, we will include this variable in our analysis after removing the missing observations.
# remove missing data
super <- na.omit(super)
cat("Number of missing values:", sum(is.na(super)))
## Number of missing values: 0
The Id column represents the unique ID of each customer, so we can use this column to check for any incorrect duplicate entries. Since there are none, we can go ahead and remove the Id column from our dataset.
# duplicates in Id
sum(duplicated(super$Id))
## [1] 0
# remove the Id column
super <- super[-1]
# count and percentage of each categories in satisfaction
response_count <- as.data.frame(table(super$Response))
response_count$percent <- response_count$Freq / sum(response_count$Freq) * 100
# pie chart
ggplot(response_count, aes(x = '', y = Freq, fill = factor(Var1))) +
geom_bar(stat = 'identity', width = 1) +
coord_polar(theta = 'y') +
theme_void() +
scale_fill_manual(values = c('#3F7ED5', '#C7DFF9'),
name = '',
labels = c('refused', 'accepted')) +
labs(title = 'Pie chart of Response') +
theme(legend.text = element_text(size = 10),
legend.position = 'right',
plot.title = element_text(hjust = 0.5, face = 'bold')) +
geom_text(aes(label = paste0(round(percent, 2), '%')),
position = position_stack(vjust = 0.5))
# list of column names
cat_cols <- colnames(super[c('Education', 'Marital_Status', 'Kidhome', 'Teenhome', 'Complain')])
for (col in cat_cols) {
# a data frame with the count of each category in the current column
cat_count <- data.frame(table(super[[col]]))
# a bar plot of the count data
print(ggplot(cat_count, aes(x = Var1, y = Freq)) +
geom_bar(stat = 'identity', fill = '#3F7ED5') +
geom_text(aes(label = Freq), vjust = -0.5) +
theme(axis.ticks = element_blank(),
plot.title= element_text(face = "bold")) +
labs(title = col, x = '', y = "Count"))
}
# name of numerical columns (with more than 20 unique values)
num_cols1 <- super %>%
select_if(function(x) length(unique(x)) > 20) %>%
colnames
# histogram for each column
for (col in num_cols1) {
print(ggplot(super, aes(x = .data[[col]])) +
geom_histogram(binwidth = 1, color = '#3F7ED5')+
labs(title = col, x = '', y = 'Frequency') +
theme(axis.ticks = element_blank(),
plot.title= element_text(face = "bold")))
}
# name of numerical columns (remaining, with less than or 20 unique values)
num_cols2 <- colnames(super)[!(colnames(super) %in% num_cols1 | colnames(super) %in% cat_cols | colnames(super) %in% "Response")]
# histogram for each numerical column
for (col in num_cols2) {
print(ggplot(super, aes(x = .data[[col]])) +
geom_histogram(binwidth = 1, color = '#3F7ED5')+
labs(title = col, x = '', y = 'Frequency') +
theme(axis.ticks = element_blank(),
plot.title= element_text(face = "bold")))
}
From the boxplots, some extreme outliers can be detected in MntMeatProducts, NumWebPurchases, NumCatalogPurchases, Income and Age. For example, in Age, there are 3 entries of over 100 years old, or in Income, there is an entry of 666666. Since there is no information on the data collection process and the nature of the data, these outliers can not be validated and therefore, best kept in the dataset as they may represent genuine variation in the data or contain important information.
# types of product
boxplot(super[, c('MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds')],
main = "The amount of money spent on types of products in the last 2 years",
names = c("wines", "fruits", "meat", "fish", "sweet", "gold"),
col = '#3F7ED5'
)
# types of channels
boxplot(super[, c('NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases')],
main = 'Number of purchases made through types of channels',
names = c("discount", "website", "catalog", "store"),
col = '#3F7ED5'
)
# Income
boxplot(super$Income,
main = "Income",
col = '#3F7ED5')
# Recency
boxplot(super$Recency,
main = "Recency",
col = '#3F7ED5')
# NumWebVisitsMonth
boxplot(super$NumWebVisitsMonth,
main = "Number of visits to the company's website in the last month",
col = '#3F7ED5')
# Age
boxplot(super$Age,
main = "Age",
col = '#3F7ED5')
# days_joined
boxplot(super$days_joined,
main = "Number of days since joining the program",
col = '#3F7ED5')
# outliers in Age
for (i in 1:length(super$Age)) {
if (super$Age[i] > 100) {
cat("Outlier:", super$Age[i], " at observation no.", i, "\n")
}
}
## Outlier: 122 at observation no. 510
## Outlier: 116 at observation no. 822
## Outlier: 115 at observation no. 2210
# outlier in Income
cat("Outlier in Income:", max(super$Income), " at observation no.", which(super$Income == max(super$Income)))
## Outlier in Income: 666666 at observation no. 523
From the correlation matrix, we can see that Recency has very weak correlation with all other variables. Therefore, we should note to drop Recency from our clustering assessment.
# correlation matrix
corr_matrix <- super %>%
select_if(is.numeric) %>%
correlate(diagonal = 1)
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
corr_matrix
## # A tibble: 19 × 20
## term Income Kidhome Teenhome Recency MntWines MntFru…¹ MntMe…² MntFis…³
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Income 1 -0.429 0.0191 -3.97e-3 0.579 0.431 0.585 4.39e-1
## 2 Kidhome -0.429 1 -0.0399 1.15e-2 -0.497 -0.373 -0.439 -3.89e-1
## 3 Teenho… 0.0191 -0.0399 1 1.38e-2 0.00375 -0.177 -0.261 -2.05e-1
## 4 Recency -0.00397 0.0115 0.0138 1 e+0 0.0157 -0.00584 0.0225 5.51e-4
## 5 MntWin… 0.579 -0.497 0.00375 1.57e-2 1 0.387 0.569 3.98e-1
## 6 MntFru… 0.431 -0.373 -0.177 -5.84e-3 0.387 1 0.548 5.93e-1
## 7 MntMea… 0.585 -0.439 -0.261 2.25e-2 0.569 0.548 1 5.74e-1
## 8 MntFis… 0.439 -0.389 -0.205 5.51e-4 0.398 0.593 0.574 1 e+0
## 9 MntSwe… 0.441 -0.378 -0.163 2.51e-2 0.390 0.572 0.535 5.84e-1
## 10 MntGol… 0.326 -0.355 -0.0199 1.77e-2 0.393 0.396 0.359 4.27e-1
## 11 NumDea… -0.0831 0.217 0.386 2.12e-3 0.00889 -0.135 -0.121 -1.43e-1
## 12 NumWeb… 0.388 -0.372 0.162 -5.64e-3 0.554 0.302 0.307 3.00e-1
## 13 NumCat… 0.589 -0.505 -0.113 2.41e-2 0.635 0.486 0.734 5.33e-1
## 14 NumSto… 0.529 -0.501 0.0497 -4.34e-4 0.640 0.458 0.486 4.58e-1
## 15 NumWeb… -0.553 0.447 0.131 -1.86e-2 -0.322 -0.419 -0.539 -4.46e-1
## 16 Respon… 0.133 -0.0779 -0.154 -2.00e-1 0.246 0.122 0.238 1.08e-1
## 17 Compla… -0.0272 0.0410 0.00331 1.36e-2 -0.0395 -0.00532 -0.0238 -2.12e-2
## 18 Age 0.162 -0.234 0.351 1.63e-2 0.159 0.0177 0.0337 4.04e-2
## 19 days_j… -0.0167 -0.0569 0.00842 3.08e-2 0.149 0.0596 0.0713 7.80e-2
## # … with 11 more variables: MntSweetProducts <dbl>, MntGoldProds <dbl>,
## # NumDealsPurchases <dbl>, NumWebPurchases <dbl>, NumCatalogPurchases <dbl>,
## # NumStorePurchases <dbl>, NumWebVisitsMonth <dbl>, Response <dbl>,
## # Complain <dbl>, Age <dbl>, days_joined <dbl>, and abbreviated variable
## # names ¹MntFruits, ²MntMeatProducts, ³MntFishProducts
# heatmap of correlation
corr_matrix %>%
rearrange(method = "MDS", absolute = FALSE) %>%
shave() %>%
rplot(shape = 15, colours = c("#D3D3D3", "#3F7ED5")) +
theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust = 1),
axis.text.y = element_text(angle = 0, vjust = 1, hjust = 1),
axis.ticks = element_blank())
In order to perform the clustering procedures, we will subset only the numerical variables. Recency and Response will also be excluded. Hence, the variables taken forward for clustering are:
Income, NumWebVisitsMonth, Age, days_joined
MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts and MntGoldProds
NumDealsPurchases, NumWebPurchases, NumCatalogPurchases and NumStorePurchases
# subset the dataset
df <- super[ , !names(super) %in% c(cat_cols, 'Response', 'Recency')]
cat("Subset df\nDimension:", dim(df), "\n\n")
## Subset df
## Dimension: 2216 14
colnames(df)
## [1] "Income" "MntWines" "MntFruits"
## [4] "MntMeatProducts" "MntFishProducts" "MntSweetProducts"
## [7] "MntGoldProds" "NumDealsPurchases" "NumWebPurchases"
## [10] "NumCatalogPurchases" "NumStorePurchases" "NumWebVisitsMonth"
## [13] "Age" "days_joined"
head(df)
## Income MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 1 84835 189 104 379 111 189
## 2 57091 464 5 64 7 0
## 3 67267 134 11 59 15 2
## 4 32474 10 0 1 0 0
## 5 21474 6 16 24 11 0
## 6 71691 336 130 411 240 32
## MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## 1 218 1 4 4
## 2 37 1 7 3
## 3 30 1 3 2
## 4 0 1 1 0
## 5 34 2 3 1
## 6 43 1 4 7
## NumStorePurchases NumWebVisitsMonth Age days_joined
## 1 6 1 45 198.70833
## 2 7 5 54 199.70833
## 3 5 2 57 232.70833
## 4 2 7 48 56.66667
## 5 2 7 26 149.70833
## 6 5 2 57 289.70833
It’s generally recommended to standardize variables in the data set before performing subsequent analysis. Standardization makes variables comparable, when they are measured in different scales.
# scaling
df2 <- scale(df)
head(df2)
## Income MntWines MntFruits MntMeatProducts MntFishProducts
## 1 1.2945477 -0.34415060 1.9511513 0.9452513 1.3399009
## 2 0.1924178 0.47107987 -0.5366661 -0.4592226 -0.5595702
## 3 0.5966592 -0.50719670 -0.3858893 -0.4815158 -0.4134571
## 4 -0.7854920 -0.87479153 -0.6623135 -0.7401173 -0.6874192
## 5 -1.2224668 -0.88664943 -0.2602420 -0.6375685 -0.4865136
## 6 0.7724026 0.09162714 2.6045175 1.0879280 3.6959757
## MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases
## 1 3.9435854 3.35874468 -0.6880206 -0.0311165
## 2 -0.6580846 -0.13442434 -0.6880206 1.0633941
## 3 -0.6093897 -0.26951927 -0.6880206 -0.3959534
## 4 -0.6580846 -0.84849756 -0.6880206 -1.1256271
## 5 -0.6580846 -0.19232217 -0.1681932 -0.3959534
## 6 0.1210341 -0.01862868 -0.6880206 -0.0311165
## NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Age
## 1 0.4540800 0.06121821 -1.7807855 -0.09841872
## 2 0.1124021 0.36883623 -0.1315448 0.65248524
## 3 -0.2292757 -0.24639982 -1.3684753 0.90278656
## 4 -0.9126314 -1.16925390 0.6930755 0.15188260
## 5 -0.5709535 -1.16925390 0.6930755 -1.68366041
## 6 1.4791135 -0.24639982 -1.3684753 0.90278656
## days_joined
## 1 -1.458227
## 2 -1.453926
## 3 -1.311971
## 4 -2.069243
## 5 -1.669009
## 6 -1.066776
Before applying any clustering algorithm to the data set, the first thing to do is to assess the clustering tendency. That is, whether applying clustering is suitable for the data.
set.seed(123)
# random data generated from the "df" data set
random_df <- apply(df, 2,
function(x){runif(length(x), min(x), (max(x)))})
random_df <- as.data.frame(random_df)
# standardize the random data sets
random_df <- scale(random_df)
We start by visualizing the data to assess whether they contains any meaningful clusters. As the data contain more than two variables, we need to reduce the dimension in order to create a scatter plot using principal component analysis (PCA) algorithm.
As in the plots, the faithful data set may contain 2 real clusters. However, the randomly generated data set doesn’t contain any meaningful clusters.
# plot faithful data set
fviz_pca_ind(prcomp(df2), title = "PCA - Faithful data", palette = "jco",
geom = "point", ggtheme = theme_classic())
# plot random data set
fviz_pca_ind(prcomp(random_df), title = "PCA - Random data",
geom = "point", ggtheme = theme_classic())
We can conduct the Hopkins Statistic test iteratively, using 0.5 as the threshold to reject the alternative hypothesis. That is, if H < 0.5, then it is unlikely that the data set has statistically significant clusters.
We can see that the superstore data set is highly clusterable (H = 0.89, which is far above the threshold 0.5). However the random_df data set is not clusterable (H = 0.5).
# compute Hopkins statistic for the superstore dataset
set.seed(123)
res <- get_clust_tendency(df2, n = nrow(df2)-1, graph = FALSE)
res$hopkins_stat
## [1] 0.8915371
# compute Hopkins statistic for a random dataset
set.seed(123)
res.re <- get_clust_tendency(random_df, n = nrow(random_df)-1,
graph = FALSE)
res.re$hopkins_stat
## [1] 0.5009627
For visual evaluation of cluster tendency, the dissimilarity (DM) matrix between the objects in the data set using the Euclidean distance measure is calculated. The dissimilarity matrix images confirm that there is a cluster structure in the original data set but not in the random one.
# faithful data
fviz_dist(dist(df2), show_labels = FALSE)+ labs(title = "Faithful data")
# random data
fviz_dist(dist(random_df), show_labels = FALSE)+ labs(title = "Random data")
Before selecting cluster algorithm, it should be noted that our dataset contains many outliers. Some clustering algorithms, such as K-means clustering, are sensitive to outliers and may not be a good method to apply. A more robust and less sensitive to noises alternative to K-means clustering is K-medoids clustering, two types of which we will consider are Partitioning around Medoids (PAM) and Clustering Large Applications (CLARA). CLARA is a variation of PAM designed to handle large dataset. In our case, we have over 2000 observations, meaning that CLARA may be more suitable than PAM. However, it should also be noted that CLARA may not produce clustering results as accurate as PAM due to its reliance on random sampling.
The algorithms evaluated are: hierarchical, K-means, PAM and CLARA. 3 internal measures to consider when determining the best clustering algorithm are:
Among the algorithms assessed, hierarchical yield the best results for all 3 internal measures. Regardless of the clustering algorithm, the optimal number of clusters is suggested to be 2.
# define algorithms to consider
clmethods <- c("hierarchical","kmeans","pam", "clara")
# apply internal measures
intern <- clValid(df2, nClust = 2:6,
clMethods = clmethods, maxitems = 2500, validation = "internal")
# summary
summary(intern)
##
## Clustering Methods:
## hierarchical kmeans pam clara
##
## Cluster sizes:
## 2 3 4 5 6
##
## Validation Measures:
## 2 3 4 5 6
##
## hierarchical Connectivity 2.9290 7.6448 14.4317 18.7187 18.7187
## Dunn 1.0428 0.5017 0.4574 0.5265 0.5265
## Silhouette 0.8031 0.6418 0.5512 0.5036 0.4938
## kmeans Connectivity 2.9290 7.6448 14.4317 26.3313 26.3313
## Dunn 1.0428 0.5017 0.4574 0.1656 0.1656
## Silhouette 0.8031 0.6418 0.5512 0.3954 0.3924
## pam Connectivity 270.7679 540.5361 846.5698 1007.2290 1173.6619
## Dunn 0.0293 0.0268 0.0246 0.0152 0.0154
## Silhouette 0.3223 0.2391 0.2068 0.1235 0.1103
## clara Connectivity 313.0734 500.9040 512.1988 710.8210 1067.2381
## Dunn 0.0424 0.0208 0.0256 0.0152 0.0134
## Silhouette 0.3411 0.2480 0.2449 0.1367 0.1193
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 2.9290 hierarchical 2
## Dunn 1.0428 hierarchical 2
## Silhouette 0.8031 hierarchical 2
4 values are considered for the measurement of stability:
The values of APN, ADM and FOM range from 0 to 1, with smaller value corresponding to highly consistent clustering results. AD has a value between 0 and infinity, and smaller values are also preferred.
The best clustering methods and number of clusters for each stability measure are summarized as follows:
# apply stability measures
stab <- clValid(df2, nClust = 2:6, clMethods = clmethods, maxitems = 2500, validation = "stability")
# display only optimal scores
optimalScores(stab)
## Score Method Clusters
## APN 0.0001931362 hierarchical 3
## AD 3.5938465451 pam 6
## ADM 0.0028772905 hierarchical 3
## FOM 0.7938961611 clara 6
There are many linkage methods for hierarchical clustering. To check if the clustering of a particular linkage is valid, we will compute the correlation between the cophenetic distances and the original distances in the distance matrix. The closer the value of correlation coefficient is to 1, the more accurately the clustering solution reflects the data. Values above 0.75 indicates that the clustering solutions is good to move forward with.
We will consider 3 linkage methods: Ward, complete and average. Comparing the correlation coefficients between the original distance and each of the cophenetic distance of the hierarchical clustering using these 3 linkages, the average linkage method is suggested to be the best choice since it has the largest value of correlation coefficient 0.85 (> 0.7).
# compute the dissimilarity matrix
res.dist <- dist(df2, method = "euclidean")
# view the first 6*6 matrix
as.matrix(res.dist)[1:6, 1:6]
## 1 2 3 4 5 6
## 1 0.000000 7.191652 6.846000 8.270489 7.924138 5.911885
## 2 7.191652 0.000000 2.328817 3.592500 3.921566 6.054592
## 3 6.846000 2.328817 0.000000 3.136840 3.976533 5.691299
## 4 8.270489 3.592500 3.136840 0.000000 2.298475 7.174513
## 5 7.924138 3.921566 3.976533 2.298475 0.000000 7.142218
## 6 5.911885 6.054592 5.691299 7.174513 7.142218 0.000000
### Ward method
res.hc.ward <- hclust(d = res.dist, method = "ward.D2")
# compute cophenetic distance
res.coph.ward <- cophenetic(res.hc.ward)
# correlation between cophenetic distance and the original distance
cor(res.dist, res.coph.ward)
## [1] 0.5958253
### complete method
res.hc.com <- hclust(d = res.dist, method = "complete")
# compute cophenetic distance
res.coph.com <- cophenetic(res.hc.com)
# correlation between cophenetic distance and the original distance
cor(res.dist, res.coph.com)
## [1] 0.7671059
### average method
res.hc.avg <- hclust(d = res.dist, method = "average")
# compute cophenetic distance
res.coph.avg <- cophenetic(res.hc.avg)
# correlation between cophenetic distance and the original distance
cor(res.dist, res.coph.avg)
## [1] 0.8496363
To determine the optimal number of clusters, we will use 3 approaches: elbow, silhouette and gap statistic method. While the elbow and gap statistic suggest 3 clusters, the silhouette suggests 2. While 3 clusters is in agreement with our assessment from the cluster algorithm selection procedeure, ultimately, we will consider both 2 and 3 clusters to see which number produces the best clustering result.
# elbow plot
# 3 clusters suggested
fviz_nbclust(df2, hcut, method = "wss") +
geom_vline(xintercept = 3, linetype = 2)+
labs(subtitle = "Elbow method")
# silhouette method
# 2 clusters suggested
fviz_nbclust(df2, hcut, method = "silhouette") +
labs(subtitle = "Silhouette method") +
theme_classic()
# gap statistic method
# 3 clusters suggested
fviz_nbclust(df2, hcut, method = "gap_stat",
nboot = 50, maxSE = list(method = 'Tibs2001SEmax',
SE.factor = 1)) +
labs(subtitle = "Gap statistic method")
# compute agglomerative hierarchical clustering
res.agnes <- agnes(x = df2, # data matrix
metric = "euclidean", # metric for distance matrix
method = "average" # Linkage method
)
# dendrogram
fviz_dend(res.agnes, show_labels = FALSE,
palette = "jco", as.ggplot = TRUE)
## Registered S3 method overwritten by 'dendextend':
## method from
## text.pvclust pvclust
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <]8;;https://github.com/kassambara/factoextra/issueshttps://github.com/kassambara/factoextra/issues]8;;>.
As shown in the dendrogram, there are relatively large cophenetic distance between some objects. If we cut-off at the top of the tree, we will get clusters with small number of data points. This may be because we have outliers in the data set which result in large distances between clusters.
# cut the tree into 2 groups
grp <- cutree(res.agnes, k = 2)
fviz_cluster(list(data = df2, cluster = grp),
ellipse.type = "convex", geom = "point",
repel = TRUE, # avoid label overplotting (slow)
show.clust.cent = FALSE, ggtheme = theme_minimal())
# cut the tree into 3 groups
grp1 <- cutree(res.agnes, k = 3)
fviz_cluster(list(data = df2, cluster = grp1),
ellipse.type = "convex", geom = "point",
repel = TRUE, # avoid label overplotting (slow)
show.clust.cent = FALSE, ggtheme = theme_minimal())
From the cluster plot, it is seen that the clusters are overlapped. Cluster 1 contains the majority of the data points and has a wide range in the plot. The other clusters with very few data points locating in the range of cluster 1. Again, it may be resulted from the outliers in the data set.
###plot the silhouette for each observation
sil <- silhouette(grp, dist(df2))
#sil <- silhouette(cutree(res.hc.avg, k = 2), dist(df2))
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 2215 0.8
## 2 2 1 0.0
# transpose the data set to perform the clustering on the variables
df.t <- t(df2)
set.seed(123)
ss <- sample(1:2034, 30) # extract 30 samples out of 2034
df3 <- df.t[, ss]
### Create a parallel socket cluster
set.seed(123)
cl <- makeCluster(2, type = "PSOCK")
### parallel version of pvclust
res.pv <- parPvclust(cl, df3, nboot=1000)
## Warning in parPvclust(cl, df3, nboot = 1000): "parPvclust" has been integrated into pvclust (with "parallel" option).
## It is available for back compatibility but will be unavailable in the future.
## Multiscale bootstrap... Done.
stopCluster(cl)
# Default plot
plot(res.pv, hang = -1, cex = 0.5)
pvrect(res.pv)
Values on the dendrogram are AU p-values (Red, left), BP values (green, right), and cluster labels (grey, bottom). Clusters with AU > = 95% are indicated by the rectangles and are considered to be strongly supported by data.
We can see from the plot that once we compute hierarchical clustering on the bootstrap samples, there are some meaningful and strong clusters. Because the sampling process reduces the impact of the outliers.
Overall, with outliers in our data set, hierarchical clustering is not suggested as an appropriate method.
Similar to the procedure in hierarchical clustering, we will use the elbow, silhouette and gap statistic method to determine the optimal number of clusters for CLARA. While elbow and gap statistic method suggest 3 as the optimal number of clusters, silhouette method suggests 2.
Recall that in our cluster algorithm selection, the FOM value suggests 6 clusters as th optimal number for the CLARA algorithm. FOM has many limitations, amongst which are the lack of robustness, the susceptibility to noises and the assumption that the data has a uniform distribution. Since our dataset contains some extreme outliers and does not have a uniform distribution, FOM value does not provide a good choice for the optimal number of clusters.
In short, moving forward, we will be considering 2 clusters and 3 clusters for the CLARA algorithm.
# elbow plot
# 3 clusters suggested
fviz_nbclust(df2, clara, method = "wss") +
geom_vline(xintercept = 3, linetype = 2)+
labs(subtitle = "Elbow method")
# silhouette method
# 2 clusters suggested
fviz_nbclust(df2, clara, method = "silhouette")+
labs(subtitle = "Silhouette method") + theme_classic()
# gap statistic method
# 3 clusters suggested
fviz_nbclust(df2, clara, method = "gap_stat", nboot = 50)+
labs(subtitle = "Gap statistic method")
## CLARA clustering
# save the numbers of clusters to test in a list
k_list <- list(2, 3)
# empty list for clara objects
clara_list <- list()
# empty list for datasets with clustering labels
dfclusters_list <- list()
#empty list for medoids objects (dataframes)
medoids_list <- list()
# loop through list of numbers of clusters
for (i in seq_along(k_list)) {
n_cluster <- k_list[[i]]
# compute clara cluster for a specified number of clusters
# save clara object to empty list
clara.res <- clara(df2, n_cluster, samples = 200, pamLike = TRUE)
clara_list[[i]] <- clara.res
# save the medoids to the empty list
medoids_list[[i]] <- clara.res$medoids
# add clustering labels to dataset
dfclusters <- cbind(df2, cluster = clara.res$cluster)
dfclusters_list[[i]] <- dfclusters
}
# example: print results for CLARA with 2 clusters
# can do the same for CLARA with 3 clusters by changing the index from 1 to 2
clara_list[[1]]
## Call: clara(x = df2, k = n_cluster, samples = 200, pamLike = TRUE)
## Medoids:
## Income MntWines MntFruits MntMeatProducts MntFishProducts
## 120 0.7277119 0.2250285 0.04131167 0.2140332 0.8467690
## 271 -0.6596830 -0.7176744 -0.58692506 -0.4458466 -0.5413061
## MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases
## 120 0.02364428 -0.4432128 -0.1681932 0.6985572
## 271 -0.43895746 -0.3081178 -0.1681932 -0.3959534
## NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Age
## 120 0.7957578 0.9840723 -0.9561652 0.06844883
## 271 -0.5709535 -0.5540178 0.6930755 -0.26528627
## days_joined
## 120 0.137690499
## 271 -0.008566125
## Objective function: 2.943866
## Clustering vector: Named int [1:2216] 1 1 2 2 2 1 1 2 1 1 1 2 1 2 1 2 2 2 ...
## - attr(*, "names")= chr [1:2216] "1" "2" "3" "4" "5" "6" "7" ...
## Cluster sizes: 969 1247
## Best sample:
## [1] 120 133 143 167 197 229 234 271 310 390 495 560 572 661 663
## [16] 680 701 724 747 751 773 783 839 842 936 943 956 960 986 1029
## [31] 1164 1230 1349 1447 1488 1563 1598 1723 1767 1783 1840 1972 2102 2228
##
## Available components:
## [1] "sample" "medoids" "i.med" "clustering" "objective"
## [6] "clusinfo" "diss" "call" "silinfo" "data"
# CLARA medoids for 2 clusters
medoids_list[[1]]
## Income MntWines MntFruits MntMeatProducts MntFishProducts
## 120 0.7277119 0.2250285 0.04131167 0.2140332 0.8467690
## 271 -0.6596830 -0.7176744 -0.58692506 -0.4458466 -0.5413061
## MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases
## 120 0.02364428 -0.4432128 -0.1681932 0.6985572
## 271 -0.43895746 -0.3081178 -0.1681932 -0.3959534
## NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Age
## 120 0.7957578 0.9840723 -0.9561652 0.06844883
## 271 -0.5709535 -0.5540178 0.6930755 -0.26528627
## days_joined
## 120 0.137690499
## 271 -0.008566125
# Dataset with clustering labels for 2 clusters
head(dfclusters_list[[1]],3)
## Income MntWines MntFruits MntMeatProducts MntFishProducts
## 1 1.2945477 -0.3441506 1.9511513 0.9452513 1.3399009
## 2 0.1924178 0.4710799 -0.5366661 -0.4592226 -0.5595702
## 3 0.5966592 -0.5071967 -0.3858893 -0.4815158 -0.4134571
## MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases
## 1 3.9435854 3.3587447 -0.6880206 -0.0311165
## 2 -0.6580846 -0.1344243 -0.6880206 1.0633941
## 3 -0.6093897 -0.2695193 -0.6880206 -0.3959534
## NumCatalogPurchases NumStorePurchases NumWebVisitsMonth Age
## 1 0.4540800 0.06121821 -1.7807855 -0.09841872
## 2 0.1124021 0.36883623 -0.1315448 0.65248524
## 3 -0.2292757 -0.24639982 -1.3684753 0.90278656
## days_joined cluster
## 1 -1.458227 1
## 2 -1.453926 1
## 3 -1.311971 2
for (i in seq_along(k_list)) {
p <- fviz_cluster(clara_list[[i]],
ellipse.type = "t", # Concentration ellipse
geom = "point", pointsize = 1,
ggtheme = theme_classic())
print(p)
}
Silhouette (Si) coefficient: or the average silhouette width measures how similar an object i is to the other objects in its own cluster versus those in the neighbor cluster. Si values range from 1 to - 1:
Both CLARA procedure with 2 clusters and 3 clusters have similar average Si coefficients (0.29 and 0.25). Additionally, in both cases, the average Si coefficient for the smallest cluster is around 0.11 - 0.13, while the average Si coefficient for the largest cluster is around 0.34 - 0.39. In other words, procedure with 2 clusters and 3 clusters produce the same level of poor clustering, since Si coefficients in both cases are relatively close to 0, meaning clustering is no better than random.
# plot the silhouette for each observation
for (i in seq_along(k_list)) {
p <- fviz_silhouette(clara_list[[i]], palette = "jco",
ggtheme = theme_classic())
print(p)
}
## cluster size ave.sil.width
## 1 1 17 0.13
## 2 2 27 0.39
## cluster size ave.sil.width
## 1 1 10 0.18
## 2 2 12 0.11
## 3 3 24 0.34
# silhouette information
silinfo_2clust <- clara_list[[1]]$silinfo
silinfo_3clust <- clara_list[[2]]$silinfo
# silhouette widths of each observation
sil_2clust <- silinfo_2clust$widths[, 1:3]
sil_3clust <- silinfo_3clust$widths[, 1:3]
# average silhouette width of each cluster
silinfo_2clust$clus.avg.widths # 2 clusters
## [1] 0.1337040 0.3875908
silinfo_3clust$clus.avg.widths # 3 clusters
## [1] 0.1752261 0.1144386 0.3427295
# the total average (mean of all individual silhouette widths)
silinfo_2clust$avg.width # 2 clusters
## [1] 0.2894982
silinfo_3clust$avg.width # 3 clusters
## [1] 0.2467616
# the size of each clusters
clara_list[[1]]$clusinfo # 2 clusters
## size max_diss av_diss isolation
## [1,] 969 23.982855 3.944294 6.389860
## [2,] 1247 9.013688 2.166469 2.401557
clara_list[[2]]$clusinfo # 3 clusters
## size max_diss av_diss isolation
## [1,] 546 24.172099 4.094751 5.042084
## [2,] 640 9.821765 3.080844 3.446752
## [3,] 1030 8.982295 1.956304 3.152156
### objects with negative silhouette
# 2 clusters
neg_sil_index_2clust <- which(sil_2clust[, 'sil_width'] < 0)
sil_2clust[neg_sil_index_2clust, , drop = FALSE]
## cluster neighbor sil_width
## 724 1 2 -0.003533189
## 1488 1 2 -0.045764156
## 751 1 2 -0.057225226
# 3 clusters
neg_sil_index_3clust <- which(sil_3clust[, 'sil_width'] < 0)
sil_3clust[neg_sil_index_3clust, , drop = FALSE]
## cluster neighbor sil_width
## 1439 1 2 -0.13986700
## 954 2 1 -0.04452003
## 199 2 3 -0.09428869
Dunn Index: reflects the compactness and separation of clusters, calculated as the ratio of the minimum inter-cluster distance to the maximum intra-cluster distance. Its range is from 0 to infinity. A higher Dunn index value indicates better clustering, where clusters are well separated and compact. Conversely, a lower Dunn index value indicates poor clustering, where clusters are not well separated or are too spread out.
From the results below, we can see that CLARA algorithm with 2 clusters and 3 clusters produce approximately the same values of Dunn Index that are very close to 0, indicating a poor clustering results for both cases.
for (i in seq_along(k_list)) {
# statistics for CLARA clustering
clara_stats <- cluster.stats(d = dist(df2),
clara_list[[i]]$cluster)
# Dunn's index
cat(k_list[[i]], "clusters - Dunn Index:", clara_stats$dunn, "\n")
}
## 2 clusters - Dunn Index: 0.02927206
## 3 clusters - Dunn Index: 0.02609056
In external validation, the result of CLARA clustering is compared with the true labels of the dataset (i.e. Response - 0/1) to assess if the clustering matches the data structure. The corrected Rand Index and Meila’s variation index VI are used to quantify the agreement between the CLARA clustering and the true labels. The value of both indices ranges from -1 (no agreement) to 1 (perfect agreement).
From the cross-tabulation between the Response labels and the CLARA clusters, a few points can be noted about the distribution profile:
When considering 2 clusters, 40.6% of the response 0 (refused the offer) is classified into cluster 1 and 59.4% into cluster 2, while 61.6% the response 1 (accepted the offer) is classified into cluster 1 and 38.4% into cluster 2. There is approximately a 40/60 split between 2 clusters in both classes of response, but the ratio is reversed from the other class.
When considering 3 clusters, most of the response 0 (refused the offer) is classified into cluster 3 (49.5%), while cluster 1 and 2 have about the same proportion (21.2% and 29.3%). Conversely, most of the response 1 (accepted the offer) is classified into cluster 1, while cluster 2 and 3 share about the same proportion (26.7% and 29.1%).
The Rand indices indicate that the agreement between the reponse labels and CLARA clustering is 0.03 for 2 clusters and 0.04 for 3 clusters. Both Rand indices are very close to 0, so the clustering is no better than random. Similarly, the Meila’s VI indicate an agreement of 1.09 for 2 clusters and 1.45 for 3 clusters. Note that Meila’s VI values are higher than 1, so the choice of clustering algorithm may not have been appropriate for this dataset.
### cross-tabulation between Response labels and CLARA with 2 clusters
tab_2clust <- table(super$Response, clara_list[[1]]$cluster)
dimnames(tab_2clust) <- list(Response = c(0, 1), Cluster = c(1, 2))
# count
tab_2clust
## Cluster
## Response 1 2
## 0 764 1119
## 1 205 128
# percentage
prop.table(tab_2clust, margin = 1)*100
## Cluster
## Response 1 2
## 0 40.57355 59.42645
## 1 61.56156 38.43844
### cross-tabulation between Response labels and CLARA with 3 clusters
tab_3clust <- table(super$Response, clara_list[[2]]$cluster)
dimnames(tab_3clust) <- list(Response = c(0, 1), Cluster = c(1, 2, 3))
# count
tab_3clust
## Cluster
## Response 1 2 3
## 0 399 551 933
## 1 147 89 97
# percentage
prop.table(tab_3clust, margin = 1)*100
## Cluster
## Response 1 2 3
## 0 21.18959 29.26182 49.54859
## 1 44.14414 26.72673 29.12913
# compute cluster stats
# CLARA with 2 clusters
stats_2clust <- cluster.stats(d = dist(df2),
super$Response,
clara_list[[1]]$cluster)
## Warning in cluster.stats(d = dist(df2), super$Response, clara_list[[1]]
## $cluster): clustering renumbered because maximum != number of clusters
# CLARA with 3 clusters
stats_3clust <- cluster.stats(d = dist(df2),
super$Response,
clara_list[[2]]$cluster)
## Warning in cluster.stats(d = dist(df2), super$Response, clara_list[[2]]
## $cluster): clustering renumbered because maximum != number of clusters
# corrected Rand Index
cat("\n2 clusters - corrected Rand Index:", stats_2clust$corrected.rand)
##
## 2 clusters - corrected Rand Index: 0.03031754
cat("\n3 clusters - corrected Rand Index:", stats_3clust$corrected.rand)
##
## 3 clusters - corrected Rand Index: 0.03936037
# Meila's variation index VI
cat("\n2 clusters - Meila's variation index VI:", stats_2clust$vi)
##
## 2 clusters - Meila's variation index VI: 1.085724
cat("\n3 clusters - Meila's variation index VI:", stats_3clust$vi)
##
## 3 clusters - Meila's variation index VI: 1.446868
Based on the results of both internal and external validation, it can be inferred that the clustering goodness level is equivalent for both 2-cluster and 3-cluster CLARA procedures. However, CLARA is also not a suitable choice for a clustering algorithm with this dataset.
# Determine the value of epsilon
dbscan::kNNdistplot(df2, k = 4)
abline(h = 3.5, lty = 2)
# DBSCAN
set.seed(123)
db <- fpc::dbscan(df2, eps = 3.5, MinPts = 4)
# plot DBSCAN results
fviz_cluster(db, data = df2, stand = FALSE,
ellipse = FALSE, show.clust.cent = FALSE,
geom = "point",palette = "jco", ggtheme = theme_classic())
# internal validation - Si coefficient
# plot the silhouette for each observation
sil_db <- silhouette(db$cluster, dist(df2))
fviz_silhouette(sil_db)
## cluster size ave.sil.width
## 0 0 25 -0.23
## 1 1 2187 0.29
## 2 2 4 0.70
# hierarchical k-means clustering
res.hk <-hkmeans(df2, 2) # 2 clusters
res.hk1 <-hkmeans(df2, 3) # 3 clusters
# visualize the tree
fviz_dend(res.hk, cex = 0.6, palette = "jco",
rect = TRUE, rect_border = "jco", rect_fill = TRUE)
# visualize the final clusters
# 2 clusters
fviz_cluster(res.hk,
ellipse.type = "t",
geom = "point", pointsize = 1,
palette = "jco", repel = TRUE,
ggtheme = theme_classic())
# 3 clusters
fviz_cluster(res.hk1,
ellipse.type = "t",
geom = "point", pointsize = 1,
palette = "jco", repel = TRUE,
ggtheme = theme_classic())
# internal validation - Si coefficient
# plot the silhouette for each observation
# 2 clusters
sil_hk <- silhouette(res.hk$cluster, dist(df2))
fviz_silhouette(sil_hk)
## cluster size ave.sil.width
## 1 1 877 0.13
## 2 2 1339 0.47
# 3 clusters
sil_hk1 <- silhouette(res.hk1$cluster, dist(df2))
fviz_silhouette(sil_hk1)
## cluster size ave.sil.width
## 1 1 620 0.10
## 2 2 588 0.09
## 3 3 1008 0.43
# fuzzy clustering
res.fanny <- fanny(df2, 2, memb.exp = 1.2) # 2 clusters
res.fanny1 <- fanny(df2, 3, memb.exp = 1.2) # 3 clusters
# visualize the final clusters
# 2 clusters
fviz_cluster(res.fanny, ellipse.type = "t", repel = TRUE,
palette = "jco", ggtheme = theme_minimal(),
legend = "right",
labelsize = 0)
## Warning: ggrepel: 1756 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
# 3 clusters
fviz_cluster(res.fanny1, ellipse.type = "t", repel = TRUE,
palette = "jco", ggtheme = theme_minimal(),
legend = "right",
labelsize = 0)
## Warning: ggrepel: 1757 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
# internal validation - Si coefficient
# plot the silhouette for each observation
# 2 clusters
sil_fanny <- silhouette(res.fanny$cluster, dist(df2))
fviz_silhouette(sil_fanny)
## cluster size ave.sil.width
## 1 1 997 0.10
## 2 2 1219 0.51
# 3 clusters
sil_fanny1 <- silhouette(res.fanny1$cluster, dist(df2))
fviz_silhouette(sil_fanny1)
## cluster size ave.sil.width
## 1 1 625 0.10
## 2 2 625 0.07
## 3 3 966 0.45
cluster <- clara_list[[1]]$cluster #2 clusters from CLARA above
df_reg <- cbind(super, as.factor(cluster)) #same as df but include response variable
head(df_reg)
## Education Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits
## 1 Graduation Divorced 84835 0 0 0 189 104
## 2 Graduation Single 57091 0 0 0 464 5
## 3 Graduation Married 67267 0 1 0 134 11
## 4 Graduation Together 32474 1 1 0 10 0
## 5 Graduation Single 21474 1 0 0 6 16
## 6 PhD Single 71691 0 0 0 336 130
## MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds
## 1 379 111 189 218
## 2 64 7 0 37
## 3 59 15 2 30
## 4 1 0 0 0
## 5 24 11 0 34
## 6 411 240 32 43
## NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases
## 1 1 4 4 6
## 2 1 7 3 7
## 3 1 3 2 5
## 4 1 1 0 2
## 5 2 3 1 2
## 6 1 4 7 5
## NumWebVisitsMonth Response Complain Age days_joined as.factor(cluster)
## 1 1 1 0 45 198.70833 1
## 2 5 1 0 54 199.70833 1
## 3 2 0 0 57 232.70833 2
## 4 7 0 0 48 56.66667 2
## 5 7 1 0 26 149.70833 2
## 6 2 1 0 57 289.70833 1
colnames(df_reg)[colnames(df_reg) == "as.factor(cluster)"] <- "cluster"
str(df_reg)
## 'data.frame': 2216 obs. of 22 variables:
## $ Education : chr "Graduation" "Graduation" "Graduation" "Graduation" ...
## $ Marital_Status : chr "Divorced" "Single" "Married" "Together" ...
## $ Income : int 84835 57091 67267 32474 21474 71691 63564 44931 65324 65324 ...
## $ Kidhome : int 0 0 0 1 1 0 0 0 0 0 ...
## $ Teenhome : int 0 0 1 1 0 0 0 1 1 1 ...
## $ Recency : int 0 0 0 0 0 0 0 0 0 0 ...
## $ MntWines : int 189 464 134 10 6 336 769 78 384 384 ...
## $ MntFruits : int 104 5 11 0 16 130 80 0 0 0 ...
## $ MntMeatProducts : int 379 64 59 1 24 411 252 11 102 102 ...
## $ MntFishProducts : int 111 7 15 0 11 240 15 0 21 21 ...
## $ MntSweetProducts : int 189 0 2 0 0 32 34 0 32 32 ...
## $ MntGoldProds : int 218 37 30 0 34 43 65 7 5 5 ...
## $ NumDealsPurchases : int 1 1 1 1 2 1 1 1 3 3 ...
## $ NumWebPurchases : int 4 7 3 1 3 4 10 2 6 6 ...
## $ NumCatalogPurchases: int 4 3 2 0 1 7 10 1 2 2 ...
## $ NumStorePurchases : int 6 7 5 2 2 5 7 3 9 9 ...
## $ NumWebVisitsMonth : int 1 5 2 7 7 2 6 5 4 4 ...
## $ Response : int 1 1 0 0 1 1 1 0 0 0 ...
## $ Complain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Age : num 45 54 57 48 26 57 61 48 61 61 ...
## $ days_joined : num 198.7 199.7 232.7 56.7 149.7 ...
## $ cluster : Factor w/ 2 levels "1","2": 1 1 2 2 2 1 1 2 1 1 ...
# make this example reproducible
set.seed(1)
#use 70% of dataset as training set and 30% as test set
sample <- sample(c(TRUE, FALSE), nrow(df_reg), replace=TRUE, prob=c(0.7,0.3))
train.data <- df_reg[sample, ]
test.data <- df_reg[!sample, ]
str(train.data)
## 'data.frame': 1551 obs. of 22 variables:
## $ Education : chr "Graduation" "Graduation" "Graduation" "Graduation" ...
## $ Marital_Status : chr "Divorced" "Single" "Married" "Single" ...
## $ Income : int 84835 57091 67267 21474 44931 65324 65324 81044 62499 67786 ...
## $ Kidhome : int 0 0 0 1 0 0 0 0 1 0 ...
## $ Teenhome : int 0 0 1 0 1 1 1 0 0 0 ...
## $ Recency : int 0 0 0 0 0 0 0 0 0 0 ...
## $ MntWines : int 189 464 134 6 78 384 384 450 140 431 ...
## $ MntFruits : int 104 5 11 16 0 0 0 26 4 82 ...
## $ MntMeatProducts : int 379 64 59 24 11 102 102 535 61 441 ...
## $ MntFishProducts : int 111 7 15 11 0 21 21 73 0 80 ...
## $ MntSweetProducts : int 189 0 2 0 0 32 32 98 13 20 ...
## $ MntGoldProds : int 218 37 30 34 7 5 5 26 4 102 ...
## $ NumDealsPurchases : int 1 1 1 2 1 3 3 1 2 1 ...
## $ NumWebPurchases : int 4 7 3 3 2 6 6 5 3 3 ...
## $ NumCatalogPurchases: int 4 3 2 1 1 2 2 6 1 6 ...
## $ NumStorePurchases : int 6 7 5 2 3 9 9 10 6 6 ...
## $ NumWebVisitsMonth : int 1 5 2 7 5 4 4 1 4 1 ...
## $ Response : int 1 1 0 1 0 0 0 0 0 1 ...
## $ Complain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Age : num 45 54 57 26 48 61 61 68 36 56 ...
## $ days_joined : num 199 200 233 150 348 ...
## $ cluster : Factor w/ 2 levels "1","2": 1 1 2 2 2 1 1 1 2 1 ...
# model
model <- glm(Response ~., data = train.data, family = binomial(logit)) %>%
stepAIC(trace = TRUE)
## Start: AIC=982.58
## Response ~ Education + Marital_Status + Income + Kidhome + Teenhome +
## Recency + MntWines + MntFruits + MntMeatProducts + MntFishProducts +
## MntSweetProducts + MntGoldProds + NumDealsPurchases + NumWebPurchases +
## NumCatalogPurchases + NumStorePurchases + NumWebVisitsMonth +
## Complain + Age + days_joined + cluster
##
## Df Deviance AIC
## - MntFruits 1 920.69 980.69
## - Complain 1 920.79 980.79
## - Age 1 921.08 981.08
## - MntFishProducts 1 922.22 982.22
## - NumDealsPurchases 1 922.27 982.27
## - Income 1 922.48 982.48
## <none> 920.58 982.58
## - MntSweetProducts 1 923.93 983.93
## - Kidhome 1 924.05 984.05
## - cluster 1 924.47 984.47
## - MntGoldProds 1 928.26 988.26
## - NumWebVisitsMonth 1 928.78 988.78
## - NumCatalogPurchases 1 929.87 989.87
## - Education 4 936.27 990.27
## - MntWines 1 931.66 991.66
## - NumWebPurchases 1 936.32 996.32
## - MntMeatProducts 1 941.17 1001.17
## - NumStorePurchases 1 947.89 1007.89
## - Teenhome 1 948.52 1008.52
## - days_joined 1 954.32 1014.32
## - Marital_Status 7 971.86 1019.86
## - Recency 1 1033.06 1093.06
##
## Step: AIC=980.69
## Response ~ Education + Marital_Status + Income + Kidhome + Teenhome +
## Recency + MntWines + MntMeatProducts + MntFishProducts +
## MntSweetProducts + MntGoldProds + NumDealsPurchases + NumWebPurchases +
## NumCatalogPurchases + NumStorePurchases + NumWebVisitsMonth +
## Complain + Age + days_joined + cluster
##
## Df Deviance AIC
## - Complain 1 920.91 978.91
## - Age 1 921.19 979.19
## - MntFishProducts 1 922.24 980.24
## - NumDealsPurchases 1 922.41 980.41
## - Income 1 922.61 980.61
## <none> 920.69 980.69
## - Kidhome 1 924.15 982.15
## - MntSweetProducts 1 924.30 982.30
## - cluster 1 924.48 982.48
## - MntGoldProds 1 928.60 986.60
## - NumWebVisitsMonth 1 928.86 986.86
## - NumCatalogPurchases 1 930.01 988.01
## - Education 4 936.27 988.27
## - MntWines 1 931.69 989.69
## - NumWebPurchases 1 936.42 994.42
## - MntMeatProducts 1 942.35 1000.35
## - NumStorePurchases 1 948.00 1006.00
## - Teenhome 1 949.03 1007.03
## - days_joined 1 954.41 1012.41
## - Marital_Status 7 972.00 1018.00
## - Recency 1 1033.14 1091.14
##
## Step: AIC=978.91
## Response ~ Education + Marital_Status + Income + Kidhome + Teenhome +
## Recency + MntWines + MntMeatProducts + MntFishProducts +
## MntSweetProducts + MntGoldProds + NumDealsPurchases + NumWebPurchases +
## NumCatalogPurchases + NumStorePurchases + NumWebVisitsMonth +
## Age + days_joined + cluster
##
## Df Deviance AIC
## - Age 1 921.42 977.42
## - MntFishProducts 1 922.41 978.41
## - NumDealsPurchases 1 922.59 978.59
## - Income 1 922.83 978.83
## <none> 920.91 978.91
## - Kidhome 1 924.28 980.28
## - MntSweetProducts 1 924.49 980.49
## - cluster 1 924.71 980.71
## - MntGoldProds 1 928.85 984.85
## - NumWebVisitsMonth 1 929.03 985.03
## - NumCatalogPurchases 1 930.12 986.12
## - Education 4 936.57 986.57
## - MntWines 1 932.04 988.04
## - NumWebPurchases 1 936.57 992.57
## - MntMeatProducts 1 942.51 998.51
## - NumStorePurchases 1 948.48 1004.48
## - Teenhome 1 949.30 1005.30
## - days_joined 1 954.51 1010.51
## - Marital_Status 7 972.03 1016.03
## - Recency 1 1033.41 1089.41
##
## Step: AIC=977.42
## Response ~ Education + Marital_Status + Income + Kidhome + Teenhome +
## Recency + MntWines + MntMeatProducts + MntFishProducts +
## MntSweetProducts + MntGoldProds + NumDealsPurchases + NumWebPurchases +
## NumCatalogPurchases + NumStorePurchases + NumWebVisitsMonth +
## days_joined + cluster
##
## Df Deviance AIC
## - MntFishProducts 1 922.86 976.86
## - NumDealsPurchases 1 923.09 977.09
## - Income 1 923.38 977.38
## <none> 921.42 977.42
## - Kidhome 1 924.56 978.56
## - MntSweetProducts 1 924.95 978.95
## - cluster 1 925.06 979.06
## - MntGoldProds 1 929.31 983.31
## - NumWebVisitsMonth 1 929.44 983.44
## - NumCatalogPurchases 1 930.75 984.75
## - Education 4 937.99 985.99
## - MntWines 1 932.55 986.55
## - NumWebPurchases 1 937.14 991.14
## - MntMeatProducts 1 942.70 996.70
## - NumStorePurchases 1 949.45 1003.45
## - Teenhome 1 949.77 1003.77
## - days_joined 1 954.69 1008.69
## - Marital_Status 7 973.03 1015.03
## - Recency 1 1033.66 1087.66
##
## Step: AIC=976.86
## Response ~ Education + Marital_Status + Income + Kidhome + Teenhome +
## Recency + MntWines + MntMeatProducts + MntSweetProducts +
## MntGoldProds + NumDealsPurchases + NumWebPurchases + NumCatalogPurchases +
## NumStorePurchases + NumWebVisitsMonth + days_joined + cluster
##
## Df Deviance AIC
## - NumDealsPurchases 1 924.37 976.37
## <none> 922.86 976.86
## - Income 1 924.91 976.91
## - MntSweetProducts 1 925.58 977.58
## - Kidhome 1 925.95 977.95
## - cluster 1 927.44 979.44
## - MntGoldProds 1 929.84 981.84
## - NumWebVisitsMonth 1 931.50 983.50
## - NumCatalogPurchases 1 931.93 983.93
## - Education 4 940.26 986.26
## - MntWines 1 935.00 987.00
## - NumWebPurchases 1 938.48 990.48
## - MntMeatProducts 1 943.05 995.05
## - NumStorePurchases 1 950.88 1002.88
## - Teenhome 1 950.99 1002.99
## - days_joined 1 955.37 1007.37
## - Marital_Status 7 974.49 1014.49
## - Recency 1 1034.44 1086.44
##
## Step: AIC=976.37
## Response ~ Education + Marital_Status + Income + Kidhome + Teenhome +
## Recency + MntWines + MntMeatProducts + MntSweetProducts +
## MntGoldProds + NumWebPurchases + NumCatalogPurchases + NumStorePurchases +
## NumWebVisitsMonth + days_joined + cluster
##
## Df Deviance AIC
## <none> 924.37 976.37
## - Kidhome 1 926.44 976.44
## - Income 1 926.51 976.51
## - MntSweetProducts 1 927.16 977.16
## - cluster 1 928.98 978.98
## - MntGoldProds 1 930.87 980.87
## - NumWebVisitsMonth 1 931.81 981.81
## - NumCatalogPurchases 1 932.67 982.67
## - Education 4 941.68 985.68
## - MntWines 1 936.66 986.66
## - NumWebPurchases 1 938.83 988.83
## - MntMeatProducts 1 943.88 993.88
## - NumStorePurchases 1 954.26 1004.26
## - days_joined 1 955.75 1005.75
## - Marital_Status 7 975.52 1013.52
## - Teenhome 1 964.16 1014.16
## - Recency 1 1034.60 1084.60
summary(model)
##
## Call:
## glm(formula = Response ~ Education + Marital_Status + Income +
## Kidhome + Teenhome + Recency + MntWines + MntMeatProducts +
## MntSweetProducts + MntGoldProds + NumWebPurchases + NumCatalogPurchases +
## NumStorePurchases + NumWebVisitsMonth + days_joined + cluster,
## family = binomial(logit), data = train.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4435 -0.4787 -0.2679 -0.1174 2.8974
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.393e+00 1.652e+00 -1.449 0.147436
## EducationBasic -1.100e+00 8.170e-01 -1.347 0.178132
## EducationGraduation 9.478e-03 3.163e-01 0.030 0.976098
## EducationMaster 2.977e-01 3.632e-01 0.820 0.412424
## EducationPhD 8.041e-01 3.458e-01 2.325 0.020063 *
## Marital_StatusAlone -1.469e+01 6.044e+02 -0.024 0.980604
## Marital_StatusDivorced -1.146e+00 1.559e+00 -0.735 0.462256
## Marital_StatusMarried -2.264e+00 1.551e+00 -1.460 0.144348
## Marital_StatusSingle -1.177e+00 1.554e+00 -0.758 0.448620
## Marital_StatusTogether -2.491e+00 1.559e+00 -1.598 0.110005
## Marital_StatusWidow -1.208e+00 1.587e+00 -0.761 0.446649
## Marital_StatusYOLO 1.267e+01 8.827e+02 0.014 0.988548
## Income 4.814e-06 3.170e-06 1.519 0.128846
## Kidhome 3.235e-01 2.247e-01 1.440 0.149848
## Teenhome -1.200e+00 2.004e-01 -5.987 2.14e-09 ***
## Recency -3.203e-02 3.334e-03 -9.605 < 2e-16 ***
## MntWines 1.312e-03 3.766e-04 3.484 0.000495 ***
## MntMeatProducts 2.345e-03 5.379e-04 4.360 1.30e-05 ***
## MntSweetProducts 4.028e-03 2.398e-03 1.679 0.093072 .
## MntGoldProds 4.531e-03 1.763e-03 2.570 0.010176 *
## NumWebPurchases 1.594e-01 4.254e-02 3.748 0.000178 ***
## NumCatalogPurchases 1.331e-01 4.693e-02 2.836 0.004571 **
## NumStorePurchases -2.121e-01 4.018e-02 -5.278 1.31e-07 ***
## NumWebVisitsMonth 1.471e-01 5.253e-02 2.800 0.005110 **
## days_joined 2.263e-03 4.146e-04 5.458 4.82e-08 ***
## cluster2 8.309e-01 3.886e-01 2.138 0.032504 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1336.47 on 1550 degrees of freedom
## Residual deviance: 924.37 on 1525 degrees of freedom
## AIC: 976.37
##
## Number of Fisher Scoring iterations: 13
# Make predictions
probabilities <- model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
# Model accuracy
mean(predicted.classes==test.data$Response)
## [1] 0.8586466
confusion_matrix <- table(test.data$Response, predicted.classes)
confusion_matrix
## predicted.classes
## 0 1
## 0 536 36
## 1 58 35
confusionMatrix(confusion_matrix, positive = "1")
## Registered S3 methods overwritten by 'proxy':
## method from
## print.registry_field registry
## print.registry_entry registry
## Confusion Matrix and Statistics
##
## predicted.classes
## 0 1
## 0 536 36
## 1 58 35
##
## Accuracy : 0.8586
## 95% CI : (0.8298, 0.8842)
## No Information Rate : 0.8932
## P-Value [Acc > NIR] : 0.99775
##
## Kappa : 0.3479
##
## Mcnemar's Test P-Value : 0.03031
##
## Sensitivity : 0.49296
## Specificity : 0.90236
## Pos Pred Value : 0.37634
## Neg Pred Value : 0.93706
## Prevalence : 0.10677
## Detection Rate : 0.05263
## Detection Prevalence : 0.13985
## Balanced Accuracy : 0.69766
##
## 'Positive' Class : 1
##
We want to compare the accuracy of the regression model with or without the clustering label. Therefore we fit the same dataset to stepwise logistic regression model, just exclude clustering label this time.
# model
model2 <- glm(Response ~., data = train.data[-22], family = binomial(logit)) %>%
stepAIC(trace = TRUE)
## Start: AIC=984.47
## Response ~ Education + Marital_Status + Income + Kidhome + Teenhome +
## Recency + MntWines + MntFruits + MntMeatProducts + MntFishProducts +
## MntSweetProducts + MntGoldProds + NumDealsPurchases + NumWebPurchases +
## NumCatalogPurchases + NumStorePurchases + NumWebVisitsMonth +
## Complain + Age + days_joined
##
## Df Deviance AIC
## - MntFruits 1 924.48 982.48
## - Complain 1 924.70 982.70
## - Age 1 924.80 982.80
## - Income 1 925.50 983.50
## - NumDealsPurchases 1 926.26 984.26
## <none> 924.47 984.47
## - MntFishProducts 1 926.98 984.98
## - MntSweetProducts 1 927.12 985.12
## - Kidhome 1 928.72 986.72
## - NumCatalogPurchases 1 931.57 989.57
## - MntGoldProds 1 932.24 990.24
## - MntWines 1 932.81 990.81
## - Education 4 940.02 992.02
## - NumWebVisitsMonth 1 934.99 992.99
## - NumWebPurchases 1 937.74 995.74
## - MntMeatProducts 1 943.60 1001.60
## - Teenhome 1 951.42 1009.42
## - days_joined 1 958.64 1016.64
## - NumStorePurchases 1 960.78 1018.78
## - Marital_Status 7 975.34 1021.34
## - Recency 1 1035.75 1093.75
##
## Step: AIC=982.48
## Response ~ Education + Marital_Status + Income + Kidhome + Teenhome +
## Recency + MntWines + MntMeatProducts + MntFishProducts +
## MntSweetProducts + MntGoldProds + NumDealsPurchases + NumWebPurchases +
## NumCatalogPurchases + NumStorePurchases + NumWebVisitsMonth +
## Complain + Age + days_joined
##
## Df Deviance AIC
## - Complain 1 924.71 980.71
## - Age 1 924.82 980.82
## - Income 1 925.52 981.52
## - NumDealsPurchases 1 926.29 982.29
## <none> 924.48 982.48
## - MntFishProducts 1 927.00 983.00
## - MntSweetProducts 1 927.29 983.29
## - Kidhome 1 928.73 984.73
## - NumCatalogPurchases 1 931.62 987.62
## - MntGoldProds 1 932.38 988.38
## - MntWines 1 932.81 988.81
## - Education 4 940.05 990.05
## - NumWebVisitsMonth 1 934.99 990.99
## - NumWebPurchases 1 937.77 993.77
## - MntMeatProducts 1 944.42 1000.42
## - Teenhome 1 951.69 1007.69
## - days_joined 1 958.65 1014.65
## - NumStorePurchases 1 960.78 1016.78
## - Marital_Status 7 975.38 1019.38
## - Recency 1 1035.78 1091.78
##
## Step: AIC=980.71
## Response ~ Education + Marital_Status + Income + Kidhome + Teenhome +
## Recency + MntWines + MntMeatProducts + MntFishProducts +
## MntSweetProducts + MntGoldProds + NumDealsPurchases + NumWebPurchases +
## NumCatalogPurchases + NumStorePurchases + NumWebVisitsMonth +
## Age + days_joined
##
## Df Deviance AIC
## - Age 1 925.06 979.06
## - Income 1 925.76 979.76
## - NumDealsPurchases 1 926.48 980.48
## <none> 924.71 980.71
## - MntFishProducts 1 927.18 981.18
## - MntSweetProducts 1 927.49 981.49
## - Kidhome 1 928.85 982.85
## - NumCatalogPurchases 1 931.75 985.75
## - MntGoldProds 1 932.64 986.64
## - MntWines 1 933.16 987.16
## - Education 4 940.35 988.35
## - NumWebVisitsMonth 1 935.17 989.17
## - NumWebPurchases 1 937.93 991.93
## - MntMeatProducts 1 944.59 998.59
## - Teenhome 1 951.97 1005.97
## - days_joined 1 958.75 1012.75
## - NumStorePurchases 1 961.35 1015.35
## - Marital_Status 7 975.41 1017.41
## - Recency 1 1036.06 1090.06
##
## Step: AIC=979.06
## Response ~ Education + Marital_Status + Income + Kidhome + Teenhome +
## Recency + MntWines + MntMeatProducts + MntFishProducts +
## MntSweetProducts + MntGoldProds + NumDealsPurchases + NumWebPurchases +
## NumCatalogPurchases + NumStorePurchases + NumWebVisitsMonth +
## days_joined
##
## Df Deviance AIC
## - Income 1 926.14 978.14
## - NumDealsPurchases 1 926.81 978.81
## <none> 925.06 979.06
## - MntFishProducts 1 927.44 979.44
## - MntSweetProducts 1 927.81 979.81
## - Kidhome 1 928.99 980.99
## - NumCatalogPurchases 1 932.25 984.25
## - MntGoldProds 1 932.95 984.95
## - MntWines 1 933.58 985.58
## - NumWebVisitsMonth 1 935.37 987.37
## - Education 4 941.50 987.50
## - NumWebPurchases 1 938.39 990.39
## - MntMeatProducts 1 944.71 996.71
## - Teenhome 1 952.55 1004.55
## - days_joined 1 958.84 1010.84
## - NumStorePurchases 1 961.99 1013.99
## - Marital_Status 7 976.14 1016.14
## - Recency 1 1036.22 1088.22
##
## Step: AIC=978.14
## Response ~ Education + Marital_Status + Kidhome + Teenhome +
## Recency + MntWines + MntMeatProducts + MntFishProducts +
## MntSweetProducts + MntGoldProds + NumDealsPurchases + NumWebPurchases +
## NumCatalogPurchases + NumStorePurchases + NumWebVisitsMonth +
## days_joined
##
## Df Deviance AIC
## - NumDealsPurchases 1 927.94 977.94
## <none> 926.14 978.14
## - MntFishProducts 1 928.51 978.51
## - MntSweetProducts 1 929.04 979.04
## - Kidhome 1 930.29 980.29
## - NumCatalogPurchases 1 933.54 983.54
## - MntGoldProds 1 933.96 983.96
## - NumWebVisitsMonth 1 935.51 985.51
## - MntWines 1 935.51 985.51
## - Education 4 942.86 986.86
## - NumWebPurchases 1 940.22 990.22
## - MntMeatProducts 1 946.52 996.52
## - Teenhome 1 953.03 1003.03
## - days_joined 1 959.98 1009.98
## - NumStorePurchases 1 962.48 1012.48
## - Marital_Status 7 976.90 1014.90
## - Recency 1 1037.82 1087.82
##
## Step: AIC=977.94
## Response ~ Education + Marital_Status + Kidhome + Teenhome +
## Recency + MntWines + MntMeatProducts + MntFishProducts +
## MntSweetProducts + MntGoldProds + NumWebPurchases + NumCatalogPurchases +
## NumStorePurchases + NumWebVisitsMonth + days_joined
##
## Df Deviance AIC
## <none> 927.94 977.94
## - MntFishProducts 1 930.11 978.11
## - Kidhome 1 930.80 978.80
## - MntSweetProducts 1 930.84 978.84
## - NumCatalogPurchases 1 934.63 982.63
## - MntGoldProds 1 935.18 983.18
## - NumWebVisitsMonth 1 935.89 983.89
## - MntWines 1 937.48 985.48
## - Education 4 944.60 986.60
## - NumWebPurchases 1 940.71 988.71
## - MntMeatProducts 1 947.58 995.58
## - days_joined 1 960.43 1008.43
## - Marital_Status 7 978.18 1014.18
## - Teenhome 1 966.48 1014.48
## - NumStorePurchases 1 966.74 1014.74
## - Recency 1 1038.03 1086.03
summary(model2)
##
## Call:
## glm(formula = Response ~ Education + Marital_Status + Kidhome +
## Teenhome + Recency + MntWines + MntMeatProducts + MntFishProducts +
## MntSweetProducts + MntGoldProds + NumWebPurchases + NumCatalogPurchases +
## NumStorePurchases + NumWebVisitsMonth + days_joined, family = binomial(logit),
## data = train.data[-22])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3784 -0.4862 -0.2731 -0.1159 2.7199
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.153e+00 1.613e+00 -0.715 0.474656
## EducationBasic -1.218e+00 8.172e-01 -1.491 0.136043
## EducationGraduation -1.992e-02 3.141e-01 -0.063 0.949424
## EducationMaster 2.680e-01 3.624e-01 0.740 0.459603
## EducationPhD 7.433e-01 3.445e-01 2.157 0.030984 *
## Marital_StatusAlone -1.458e+01 6.109e+02 -0.024 0.980963
## Marital_StatusDivorced -1.332e+00 1.549e+00 -0.860 0.389728
## Marital_StatusMarried -2.432e+00 1.541e+00 -1.578 0.114496
## Marital_StatusSingle -1.333e+00 1.543e+00 -0.864 0.387848
## Marital_StatusTogether -2.627e+00 1.548e+00 -1.697 0.089683 .
## Marital_StatusWidow -1.423e+00 1.579e+00 -0.901 0.367658
## Marital_StatusYOLO 1.281e+01 8.827e+02 0.015 0.988421
## Kidhome 3.792e-01 2.239e-01 1.694 0.090299 .
## Teenhome -1.181e+00 2.003e-01 -5.893 3.80e-09 ***
## Recency -3.193e-02 3.324e-03 -9.604 < 2e-16 ***
## MntWines 1.104e-03 3.570e-04 3.092 0.001986 **
## MntMeatProducts 2.360e-03 5.384e-04 4.383 1.17e-05 ***
## MntFishProducts -2.906e-03 1.995e-03 -1.457 0.145150
## MntSweetProducts 4.172e-03 2.439e-03 1.710 0.087224 .
## MntGoldProds 4.865e-03 1.796e-03 2.709 0.006747 **
## NumWebPurchases 1.454e-01 4.115e-02 3.535 0.000408 ***
## NumCatalogPurchases 1.174e-01 4.566e-02 2.570 0.010165 *
## NumStorePurchases -2.300e-01 3.889e-02 -5.914 3.33e-09 ***
## NumWebVisitsMonth 1.451e-01 5.023e-02 2.889 0.003865 **
## days_joined 2.294e-03 4.134e-04 5.549 2.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1336.47 on 1550 degrees of freedom
## Residual deviance: 927.94 on 1526 degrees of freedom
## AIC: 977.94
##
## Number of Fisher Scoring iterations: 13
# Make predictions
probabilities2 <- model2 %>% predict(test.data[-22], type = "response")
predicted.classes2 <- ifelse(probabilities2 > 0.5, 1, 0)
# Model accuracy
mean(predicted.classes2==test.data$Response)
## [1] 0.8661654
confusion_matrix2 <- table(test.data$Response, predicted.classes2)
confusion_matrix2
## predicted.classes2
## 0 1
## 0 540 32
## 1 57 36
confusionMatrix(confusion_matrix2, positive = "1")
## Confusion Matrix and Statistics
##
## predicted.classes2
## 0 1
## 0 540 32
## 1 57 36
##
## Accuracy : 0.8662
## 95% CI : (0.8379, 0.8911)
## No Information Rate : 0.8977
## P-Value [Acc > NIR] : 0.99603
##
## Kappa : 0.3732
##
## Mcnemar's Test P-Value : 0.01096
##
## Sensitivity : 0.52941
## Specificity : 0.90452
## Pos Pred Value : 0.38710
## Neg Pred Value : 0.94406
## Prevalence : 0.10226
## Detection Rate : 0.05414
## Detection Prevalence : 0.13985
## Balanced Accuracy : 0.71697
##
## 'Positive' Class : 1
##
### cross-tabulation between Response and CLARA clusters
tab_response <- table(super$Response, cluster)
# count
tab_response
## cluster
## 1 2
## 0 764 1119
## 1 205 128
# percentage
prop.table(tab_response, margin = 1)*100
## cluster
## 1 2
## 0 40.57355 59.42645
## 1 61.56156 38.43844
# perform chi-square test
chisq.test(tab_response)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab_response
## X-squared = 49.805, df = 1, p-value = 1.698e-12
# create the side-by-side bar graph
barplot(tab_response, beside = TRUE, xlab = "Cluster", ylab = "Frequency",
col = c("#D3D3D3", "#3F7ED5"), legend.text = rownames(tab_response),
args.legend = list(title = "Response", x = "top"))
There’re several variables for marital status, such as married, divorced, together, single, we make it into two groups: married / together for one group and other status for another group.
### cross-tabulation between Marital_Status and CLARA clusters
tab_ms <- table(super$Marital_Status, cluster)
# count
tab_ms
## cluster
## 1 2
## Absurd 2 0
## Alone 0 3
## Divorced 106 126
## Married 367 490
## Single 199 272
## Together 251 322
## Widow 44 32
## YOLO 0 2
# percentage
prop.table(tab_ms, margin = 1)*100
## cluster
## 1 2
## Absurd 100.00000 0.00000
## Alone 0.00000 100.00000
## Divorced 45.68966 54.31034
## Married 42.82380 57.17620
## Single 42.25053 57.74947
## Together 43.80454 56.19546
## Widow 57.89474 42.10526
## YOLO 0.00000 100.00000
### group the data
tab_ms <- table(super$Marital_Status, cluster)
new_row_married <- colSums(tab_ms[c(4,6),])
new_row_others <- colSums(tab_ms[c(1:3,5,7,8),])
new_table_ms <- rbind(new_row_married, new_row_others)
rownames(new_table_ms)[1] <- "Married / Together"
rownames(new_table_ms)[2] <- "Others"
# count
new_table_ms
## 1 2
## Married / Together 618 812
## Others 351 435
# percentage
prop.table(new_table_ms, margin = 1)*100
## 1 2
## Married / Together 43.21678 56.78322
## Others 44.65649 55.34351
From the chi-square test and side-by-side boxplot, it’s found that the marital status (married / together and others) are independent with the clustering group. In chi-square test, p-value is 0.54, null hypothesis is failed to reject.
# perform chi-square test
chisq.test(new_table_ms)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: new_table_ms
## X-squared = 0.37075, df = 1, p-value = 0.5426
# create the side-by-side bar graph
barplot(new_table_ms, beside = TRUE, xlab = "Cluster", ylab = "Frequency",
col = c("#D3D3D3", "#3F7ED5"), legend.text = rownames(new_table_ms),
args.legend = list(title = "Marital Status", x = "topleft"))
For number of kids at home, we also make it into two groups, no kids and with kids. The percentage of response and marital status seems not have huge different between 2 clusters. What we are interested is “with / without kids” at home.
### cross-tabulation between Kidhome and CLARA clusters
tab_kid <- table(super$Kidhome, cluster)
# count
tab_kid
## cluster
## 1 2
## 0 885 398
## 1 82 805
## 2 2 44
# percentage
prop.table(tab_kid, margin = 1)*100
## cluster
## 1 2
## 0 68.978956 31.021044
## 1 9.244645 90.755355
## 2 4.347826 95.652174
### group the data
tab_kid <- table(super$Kidhome, cluster)
new_row <- colSums(tab_kid[2:3,])
new_table_kid <- rbind(tab_kid[1,], new_row)
rownames(new_table_kid)[1] <- "No Kids at home"
rownames(new_table_kid)[2] <- "With Kids at home"
# count
new_table_kid
## 1 2
## No Kids at home 885 398
## With Kids at home 84 849
# percentage
prop.table(new_table_kid, margin = 1)*100
## 1 2
## No Kids at home 68.978956 31.02104
## With Kids at home 9.003215 90.99678
In contrast to marital status, side-by-side boxplot suggests that with or without kids may associate with the cluster group. Chi-square test further proved that as p-value is smaller than 0.05.
# perform chi-square test
chisq.test(new_table_kid)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: new_table_kid
## X-squared = 787.22, df = 1, p-value < 2.2e-16
# create the side-by-side bar graph
barplot(new_table_kid, beside = TRUE, xlab = "Cluster", ylab = "Frequency",
col = c("#D3D3D3", "#3F7ED5"), legend.text = rownames(new_table_kid),
args.legend = list(title = "Kids Status", x = "top"))
### cross-tabulation between Teenhome and CLARA clusters
tab_teen <- table(super$Teenhome, cluster)
# count
tab_teen
## cluster
## 1 2
## 0 541 606
## 1 404 614
## 2 24 27
# percentage
prop.table(tab_teen, margin = 1)*100
## cluster
## 1 2
## 0 47.16652 52.83348
## 1 39.68566 60.31434
## 2 47.05882 52.94118
### group the data
tab_teen <- table(super$Teenhome, cluster)
new_row <- colSums(tab_teen[2:3,])
new_table_teen <- rbind(tab_teen[1,], new_row)
rownames(new_table_teen)[1] <- "No Teens at home"
rownames(new_table_teen)[2] <- "With Teens at home"
# count
new_table_teen
## 1 2
## No Teens at home 541 606
## With Teens at home 428 641
# percentage
prop.table(new_table_teen, margin = 1)*100
## 1 2
## No Teens at home 47.16652 52.83348
## With Teens at home 40.03742 59.96258
# perform chi-square test
chisq.test(new_table_teen)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: new_table_teen
## X-squared = 11.141, df = 1, p-value = 0.0008446
# create the side-by-side bar graph
barplot(new_table_teen, beside = TRUE, xlab = "Cluster", ylab = "Frequency",
col = c("#D3D3D3", "#3F7ED5"), legend.text = rownames(new_table_teen),
args.legend = list(title = "Teens Status", x = "top"))
### cross-tabulation between Teenhome and CLARA clusters
tab_education <- table(super$Education, cluster)
# count
tab_education
## cluster
## 1 2
## 2n Cycle 76 124
## Basic 1 53
## Graduation 507 609
## Master 144 221
## PhD 241 240
# percentage
prop.table(tab_education, margin = 1)*100
## cluster
## 1 2
## 2n Cycle 38.000000 62.000000
## Basic 1.851852 98.148148
## Graduation 45.430108 54.569892
## Master 39.452055 60.547945
## PhD 50.103950 49.896050
### group the data
tab_edu <- table(super$Education, cluster)
new_row <- colSums(tab_edu[c(2,3),])
new_row1 <- colSums(tab_edu[c(1, 4:5),])
new_table_edu <- rbind(new_row, new_row1)
rownames(new_table_edu)[1] <- "Basic Education / Graduation"
rownames(new_table_edu)[2] <- "Higher Education"
# count
new_table_edu
## 1 2
## Basic Education / Graduation 508 662
## Higher Education 461 585
# percentage
prop.table(new_table_edu, margin = 1)*100
## 1 2
## Basic Education / Graduation 43.41880 56.58120
## Higher Education 44.07266 55.92734
# perform chi-square test
chisq.test(new_table_edu)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: new_table_edu
## X-squared = 0.07122, df = 1, p-value = 0.7896
# create the side-by-side bar graph
barplot(new_table_edu, beside = TRUE, xlab = "Cluster", ylab = "Frequency",
col = c("#D3D3D3", "#3F7ED5"), legend.text = rownames(new_table_edu),
args.legend = list(title = "Education", x = "topleft"))
Clustering tendency of the dataset is verified by the Hopkins statistic method and visualization techniques, and compared it with a random dataset.
Hierarchical clustering may not be appropriate for this dataset, even though it’s the best algorithm recommended.
2 clusters is better for the dataset. CLARA, hierarchical k-means and fuzzy k-means clustering are the preferred methods.
Clustering algorithms can identify patterns or structures. By grouping similar data points together into clusters, it is easier to identify meaningful patterns and relationships among the data.
To predict the response variable, logistic regression is preferred. When we consider logistic regression or clustering algorithm to analyze the data, it depends on the purpose of the study.
Clustering group is associated with number of kids.
Ilu, Saratu Yusuf, and Rajesh Prasad. “Improved Autoregressive Integrated Moving Average Model for COVID-19 Prediction by Using Statistical Significance and Clustering Techniques.” Heliyon 9, no. 2 (February 1, 2023). doi:10.1016/j.heliyon.2023.e13483.